H2Opred: a robust and efficient hybrid deep learning model for predicting 2’-O-methylation sites in human RNA

Abstract 2’-O-methylation (2OM) is the most common post-transcriptional modification of RNA. It plays a crucial role in RNA splicing, RNA stability and innate immunity. Despite advances in high-throughput detection, the chemical stability of 2OM makes it difficult to detect and map in messenger RNA. Therefore, bioinformatics tools have been developed using machine learning (ML) algorithms to identify 2OM sites. These tools have made significant progress, but their performances remain unsatisfactory and need further improvement. In this study, we introduced H2Opred, a novel hybrid deep learning (HDL) model for accurately identifying 2OM sites in human RNA. Notably, this is the first application of HDL in developing four nucleotide-specific models [adenine (A2OM), cytosine (C2OM), guanine (G2OM) and uracil (U2OM)] as well as a generic model (N2OM). H2Opred incorporated both stacked 1D convolutional neural network (1D-CNN) blocks and stacked attention-based bidirectional gated recurrent unit (Bi-GRU-Att) blocks. 1D-CNN blocks learned effective feature representations from 14 conventional descriptors, while Bi-GRU-Att blocks learned feature representations from five natural language processing-based embeddings extracted from RNA sequences. H2Opred integrated these feature representations to make the final prediction. Rigorous cross-validation analysis demonstrated that H2Opred consistently outperforms conventional ML-based single-feature models on five different datasets. Moreover, the generic model of H2Opred demonstrated a remarkable performance on both training and testing datasets, significantly outperforming the existing predictor and other four nucleotide-specific H2Opred models. To enhance accessibility and usability, we have deployed a user-friendly web server for H2Opred, accessible at https://balalab-skku.org/H2Opred/. This platform will serve as an invaluable tool for accurately predicting 2OM sites within human RNA, thereby facilitating broader applications in relevant research endeavors.

It's crucial to note that the sum of frequencies for nucleic acids within an  must equate to 1.For example, if the sliding window ( = 5) comprises an RNA subsequence like [A, C, C, U, U], the resulting frequencies should be [0.2,0.4, 0, 0.4], representing the occurrence percentages of nucleic acid types A, C, G, and U within the given sliding window, respectively.PS2: PS2 denotes pairs of adjacent nucleotides in a pairwise fashion, encompassing combinations such as AA, GC, CG, GU, CA,…UU, totaling 16 distinct pairs [1].These 16 pairs are then encoded into 16 binary bits, each falling within the set {0, 1}.For example, AG is represented as (0010000000000000), while GC is denoted as (0000000001000000).Consequently, an RNA sequence like AGGU is encoded as (001000000000000000000000001000000000000000010000).
CKSNAP: CKSNAP integrates the concepts of NAC and PS2 within the framework of nucleic acid pairs.Specifically, CKSNAP computes the frequencies of 16 nucleic acid pairs, where each pair is separated by a k-spaced nucleic acid, denoted as  ∈ {0, 1, 2, 3, 4, 5}.These 16 pairs align with those in PS2, encompassing combinations like AA, CC, GG, UA, CA, …, UU.Notably, when  = 0, the 16 pairs in CKSNAP are generated similarly to those in PS2.However, unlike PS2, CKSNAP calculates these 16 pairs based on NAC instead of encoding them into binary bits.In practical terms, for an RNA sequence of length , the feature vector for each k-spaced scenario can be defined as follows: ( where ( "# ) represents the frequency of the paired nucleic acid , and it can be computed as follows: (3) "# ∈ {, , , , , . .., }, where  = 1, 2, 3, . . .,  with  = 16 represents the number of pairs, and Σ( "# ) denotes the total count of the paired nucleic acid  in the given sequence.

PseEIIP:
The method of using electron-ion interaction pseudo potential (EIIP) values was utilized to encode DNA or protein sequences.This approach involves evaluating the energy of delocalized electrons within nucleotide or amino acid sequences.When applied to RNA sequences, specific EIIP values are assigned: A is encoded with a value of 0.1260, C with 0.1340, G with 0.0806, and U (equivalent to T) with a value of 0.1335, as reported in [2].
However, some properties are unique to specific dinucleotides, such as keto (GU), adenine content, purine (AG) content, guanine content, GC content, cytosine content, and thymine content.It is important to note that the default value for these properties is zero for all dinucleotides, except for the specific ones mentioned.
The DPCP_1 can be defined as: (7) where and are nucleotide symbols representing adenine (A), cytosine (C), guanine (G), or uracil (U).The variable  ranges from 1 to  ,-, , indicating the total number of physicochemical properties.Similarly,  ranges from 1 to  ., representing the number of dinucleotides.() signifies the normalized frequency of the dinucleotide , and ) denotes the i-th physicochemical property.Consequently, the output of DPCP_1 results in a 352D vector ( 16 × 22) , representing the various dinucleotides and physicochemical properties, respectively.
BPF: BPF, commonly referred to as one-hot encoding, is a popular technique utilized for encoding various biological sequences such as DNA, RNA, peptides, and proteins.
LPSDF, a method of dinucleotide composition, determines the frequency of dinucleotides formed by the nucleotide at a specific position and the preceding position within an RNA sequence.The occurrence frequencies of this dinucleotide at the i-th position can be calculated as follows: (12) Here,  represents the length of the subsequence { " (%) ,  " (&) , . . .,  " (!) } , where  =  -.In equation ( 12),  denotes the length of the provided RNA sequence, and  " ,  # belong to the set {, , , }.

MMNF:
The MMNF is formed through the integration of multivariate mutual information (MMI) and accumulated nucleotide frequency (ANF).The procedure for deriving MMI and ANF is outlined as follows: To calculate the MMI encoding [7], we employ the frequencies of k-mers where  ∈ { 2, 3}.
Subsequently, the mutual information corresponding to these frequencies can be computed as follows: And for  = 3, ) .

DBE:
The DBE descriptor captures positional information regarding each dinucleotide in the sequence.There are 16 possible dinucleotides, and each is represented by a 4D 0/1 vector within the descriptor.For instance, AA is represented as (0,0,0,0), AU as (0,0,1,1), CC as (0,1,0,1), and so on, with UU represented as (1,1,1,1).Consequently, the DBE produces a 160D (40×4) 0/1 vector for the given sequence.Subsequently, these tokens undergo processing through 12 Transformer blocks to derive comprehensive representations.While DNABERT was primarily developed for DNA sequences, its applicability to RNA sequences has been demonstrated in [8,9].Specifically, RNA sequences differ from DNA sequences by just one base (thymine to uracil), while the syntax and semantics largely remain unchanged.As a result, DNABERT can be used to process 2OM RNA sequences by simply replacing the nucleotide "U" with the nucleotide "T" before tokenization.Notably, when provided with an RNA sequence, DNABERT can generate a feature matrix that not only contains embeddings (768D) but also valuable temporal-sequential ordering information.
Seq2Vec: Seq2Vec, a model designed for converting sequence-to-vector, takes inspiration from ELMo (embedding from language model).The ELMo framework encompasses two primary components: a pair of bidirectional long short-term memory (Bi-LSTM) layers and a character-aware CNN layer.The integration of the characteraware CNN layer serves to distill insights from individual characters, thereby shaping the token representations.These token representations undergo further refinement through two successive Bi-LSTM layers, facilitating the acquisition of context-independent embeddings.Notably, within the ELMo architecture, all layers are harmonized via a linear layer positioned on top, resulting in the ultimate representations.For a given RNA sequence, it orchestrates the generation of a feature matrix endowed with embeddings (1024D) containing crucial temporal-sequential ordering information.Word2Vec: Word2Vec [10] is a language model (LM) conceptualized by Google, which combines the strengths of the continuous bag-of-words (CBOW) and continuous skipgram models.This fusion enhances the precision of word representations.Through exposure to extensive text corpora, Word2Vec demonstrates its ability to ingeniously map words onto dense vector representations that intricately encode semantic relationships throughout the lexicon.Here, we harnessed the potential of Word2Vec to distill enriching embedding features from RNA sequences, which resulted in a feature matrix encompassing 512D embeddings.
FastText: FastText [11] is a text classification model developed by the Facebook Research team in 2016.It is known for its simplicity and effectiveness, and is often used as a foundational model for various text classification tasks.FastText operates by incorporating n-gram features and using a hierarchical SoftMax loss to construct a linear classifier.This unique approach allows FastText to capture statistical associations between words and their contextual information, leading to its exceptional performance in text classification.Here, we leveraged FastText to generate numerical feature vectors from RNA sequences.These embeddings are numerical representations of words or phrases that are carefully crafted to capture their semantic meaning.By employing FastText embedder, we obtained a feature matrix consisting of 512D embeddings along with sequence ordering information for the RNA sequences.
GloVe: GloVe [12] is a word representation model that employs global word-word cooccurrence counts to capture statistical information about the relationships between words.Unlike models that rely on local context windows, GloVe harnesses the full corpus to learn significant substructures within the word vector domain.Here, GloVe was used to extract valuable features from RNA sequences, which resulted in feature matrix consisting of 512D that captures both the sematic and sequential information.
categories of chemical attributes: ring structure (comprising purines A, G, and pyrimidines C, U), hydrogen bond (strong in C, G, and weak in A, U), and functional group (including amino A, C, and keto G, U).Each property has two classifications, signifying specific nucleotide types distinguished by their unique chemical properties.
binary vector: A (1000), C (0100), G (0010), and U (0001).Consequently, for an RNA sequence of length , the resulting vector is flattened, possessing dimensions of 4 × .ASLPN:The ASLPN method combines two techniques: adaptive skipped dinucleotide composition (ASDC) and local position-specific dinucleotide frequency (LPSDF).Here is a brief overview of each:ASDC encoding is an advanced version of dinucleotide composition, integrating the kskip-n-gram model.Specifically, it utilizes k-skipped nucleotides during n-gram model computation, incorporating both distance and composition information.Due to the exponential increase in feature vector dimensions with higher  −  values, this method focuses on  = 2 (dinucleotide) analysis.For an RNA sequence of length , the output feature vector of ASDC is represented as follows: We extracted several NLP-based embeddings from RNA sequences to enhance the richness of feature representation for predicting 2'-O-methylation (2OM) sites.These embeddings, including DNABERT, Seq2Vec, Word2Vec, FastText, and GloVe, capture both embedding information and preserving the inherent sequence order characteristics of RNA.Here is an overview of the different features of each NLP-based embedding.DNABERT: DNABERT[8] is a pre-trained model tailored for processing DNA sequences.It uses a bidirectional encoder representation from the Transformer model, which allows it to learn relationships between nucleotides in both directions of sequence.DNABERT employs a Kmer-based tokenizer to tokenize DNA sequences into discrete units.

Figure S6 .
Figure S6.Comparison of performance between the average performance of the nucleotide-specific H2Opred models and the generic H2Opred model on the training and independent testing datasets.

Figure
Figure S7.(A-B) Performance comparison among 154 conventional models based on the MCC on the generic training and testing datasets.(C-D) Performance comparison between the generic H2Opred model and the top five best conventional models on the generic training and testing datasets.

Figure S8 .
Figure S8.Performance comparison between H2Opred models and the state-of-the-art predictors on the independent testing datasets.